Genome-wide investigations reveal the population structure and selection signatures of Nigerian cattle adaptation in the sub-Saharan tropics

Background Cattle are considered to be the most desirable livestock by small scale farmers. In Africa, although comprehensive genomic studies have been carried out on cattle, the genetic variations in indigenous cattle from Nigeria have not been fully explored. In this study, genome-wide analysis based on genotyping-by-sequencing (GBS) of 193 Nigerian cattle was used to reveal new insights on the history of West African cattle and their adaptation to the tropical African environment, particularly in sub-Saharan region. Results The GBS data were evaluated against whole-genome sequencing (WGS) data and high rate of variant concordance between the two platforms was evident with high correlated genetic distance matrices genotyped by both methods suggestive of the reliability of GBS applicability in population genetics. The genetic structure of Nigerian cattle was observed to be homogenous and unique from other African cattle populations. Selection analysis for the genomic regions harboring imprints of adaptation revealed genes associated with immune responses, growth and reproduction, efficiency of feeds utilization, and heat tolerance. Our findings depict potential convergent adaptation between African cattle, dogs and humans with adaptive genes SPRY2 and ITGB1BP1 possibly involved in common physiological activities. Conclusion The study presents unique genetic patterns of Nigerian cattle which provide new insights on the history of cattle in West Africa based on their population structure and the possibility of parallel adaptation between African cattle, dogs and humans in Africa which require further investigations. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08512-w.

The performance of GBS at variant level comparatively to WGS. The statistical estimates for the rate of variant concordance, sensitivity, specificity and ti/tv ratio between GBS and WGS SNP calls are indicated.   of the 268 bovine samples based on our data and previous data used in this study [20][21][22][23].
Colours represent cattle populations from different geographical regions as described in Supplementary Tables S1 and 2  This is probably because of the differences in genotyping platforms between WGS and its counterpart GBS data which is associated with missing information due to its low sequencing coverage. Therefore, we recommend that in population genomics studies, datasets generated from different platforms should not be used especially in genome-wide investigations, or else the datasets of both the target population and the reference samples should be consistently genotyped from the beginning of the study prior to downstream analyses.   reduced nucleotide diversity after lifting over the reference genome. This could imply that the process of lifting over between versions of reference genomes may lose some bit of the genomic content. As in our case we observed that 3% of the genomic content was lost leaving 97% of the total initial genomic content. Our study therefore recommends that calling of the variants should be performed using the same version of reference genome from the very beginning of the analyses.   This assessment is important to give insights on the accuracy and quality of GBS SNP calls for accurate and robust conclusions.

Cattle sample information
Our analysis for the evaluation of GBS data was carried out using five samples that were randomly selected from a set of 193 cattle samples from Nigeria. These samples were uniquely sequenced and genotyped by both GBS pipeline procedures and WGS for comparison (Additional file 1: Figure S1).

SNP discovery and genotyping
These five Nigerian samples were genotyped separately by two different SNP discovering platforms: the GBS SNP discovering platform [1] and the WGS SNP discovering platform. The pair-end reads generated by each respective sequencing library were first mapped separately against the bovine reference genome (UMD 3.1) using BWA mem [2] prior to scoring of SNPs and genotyping. Marking and removal of duplicates were done using the Picard tool. Both datasets (GBS and WGS) were subjected to GATK v3.8 SNP detection pipeline and used UnifiedGenotyper tool to call SNPs. The detected raw variants were subjected to hard filtering criteria according to GATK Best Practices recommendations [3][4][5]. The filtering criteria include Further, only 20% of missing information was tolerated in both datasets using VCFtools [6] generating high quality biallelic SNPs of > 95% call rate. To analyze the genotype data, the three states of genotypes were coded as heterozygotes (RA), homozygous (RR) which are reference based and homozygous variants (AA).
The following statistics were used to compare the performance and concordance rates between the GBS and WGS SNP scores: 1) Bi-allelic variants concordance rate and 2) genotype concordance rate 3) Sensitivity and Specificity and 4) The transition to transversion (ti/tv) ratio (Additional file 1: Figure S1) and 5) the correlation test between genetic distances computed by GBS and WGS for SNPs typed by both methods. Further, we used several terminologies to evaluate the extent of genotype agreement between GBS and WGS. For instance, we considered genotypes that match in common to be termed as "genotype matches", but those called differently between the two call sets were termed "genotype mismatches". The rate of allele concordance was extrapolated only by alleles that were called the same in both datasets such that alleles that are called homozygous in one dataset but heterozygous in another were considered mismatch alleles sharing only a single allele in common. All of the statistics were computed by using tools and modules in GATK for both variant and genotype concordance evaluations.
Nonetheless, the concordance rates by genotypes called per each sample were computed in two different scenarios (Additional file 1: Figure S2 and Additional file 1: Table S15): Scenario 1.
All genotypes (All Geno) -in this scenario all genotypes such as those regarded as of "no genotype calls" and "unavailable genotype calls" were still included in concordance analysis; Scenario 2. Filtered genotypes (Filtered) -in this second scenario, we excluded and filtered out all genotypes called as "no genotype calls" and "unavailable genotype calls", as such only genotypes of consistent states in both sets were considered.

Notes 2. Extended results on comparison of GBS and WGS
Apart from determining the rate of concordance based on variants, we also evaluated the efficacy of GBS by comparing its genotypes with respect to those found in WGS. Similarly, only the filtered genotypes in scenario 2 (Additional file 1: Figure S2; Additional file 1: Table S15) were used to determine the concordance rates at the genotype level in which the highest rate of concordance was observed in calling heterozygous with an equivalent of 92.8% concordance (Additional file 1: Table S16 and Additional file 1: Figure S11). On the other hand, although discordances were observed they were still relatively low as compared to the rate of concordances observed (Additional file 1: Table S16). The high rate of discordance/genotype mismatch is in accordance with the fact that most low sequencing coverage SNP panels have a tendency to underscore heterozygous as homozygous because the latter are considered to be easily genotyped [7,8].
To further validate the accuracy and performance of GBS, we estimated and compared some statistical metrics such as sensitivity and specificity of GBS and WGS SNP calls. We found that the sensitivity of GBS to WGS calls was 1.8% with 100% specificity (Additional file 1: Figure   S1). The former rate extrapolates that GBS platform was able to uncover about 2% of the true gold standard variants which were 99.99% concordant. On the other hand, GBS has the power of completely ignoring false positives that may have resulted from sequencing artifacts or reads alignment errors due to its 100% specificity observed. Nonetheless, the ti/tv ratio for GBS was shown to be 2.03 for all known variants but was very low at novel sites (0.8) (Additional file 1: Figure S1). However, the ti/tv for concordant variants and for the WGS calls was reasonably high at ratios of 2.2 and 2.25 respectively. These two ti/tv ratios are similar indicating the validity and high accuracy of GBS calls which further conformed its applicability in downstream genomic analyses.
We lastly measured the accuracy of GBS to WGS using some population genetics parameters such as genetic diversity and principal component analysis (PCA) (Additional file 1: Figure S12).
We first explored the distribution of the polymorphic SNPs (usually that minor allele frequency of a particular allele in the population above 1%) for both datasets by observing their minor allele frequency (MAF) distribution. The GBS and WGS data of the same five individuals were pruned by using PLINK v1.9 software and obtained 187,171 and 4,921,499 total number of variants respectively falling within the category of minor alleles (0.0 ≤ MAF < 0.5). The variants within this category for GBS data behave by skewing more to the right (Additional file 1: Figure S13a) compared to WGS data (Additional file 1: Figure S13b). The density of SNP variants between 0.0 -0.1 is seemingly higher in GBS data compared to WGS. Nonetheless, the amount of polymorphic information captured by GBS is observed to decrease disproportionally with the increase in MAF particularly between 0.1 -0.5 compared to WGS, which could probably indicate that GBS generally underestimates the frequency with regards to WGS. PCA was applied to analyze the spatial distribution of the five individuals in both datasets (Additional file 1: Figure S12a-c). We observed similar pattern of clustering of the five individuals in both datasets, which further raises the level of confidence that GBS data can be applied in downstream genetic analyses. The genetic diversity was also inferred based on their overlapping genomic regions as shown in Additional file 1: Figure S12d. It is shown that the genetic diversity captured by GBS was slightly low compared to WGS probably due to the overall low MAF depicted by GBS. Generally, these parameters have highlighted the differences observed in the allele frequencies estimated by the two different genotyping platforms, and probably the difference is due to artefact brought about by the low sequencing coverage of GBS (~5X). The coverage of GBS is far too low compared to that of WGS which is averagely~10X (Additional file 2: Table S14) Overall, we show that GBS approach is generally applicable in downstream cattle genomic analyses due to its high concordance with the WGS data that were generated using the same cattle samples based on the variant calling information. The genotypes assessment showed that GBS had a~93% rate of concordance for heterozygous calls, referring to having the same heterozygosity calls at the same site in both datasets along the genotype field, despite of some inadequacy in genotype counts (Additional file 1: Table S16) The low proportion (2.6%) of the called heterozygous genotypes by GBS detected in WGS is obviously due to the low sequencing coverage of GBS (Additional file 1: Table S16). Some studies indicate that to generate heterozygous calls in diploid organisms such as cattle requires a minimum amount of sequence coverage reads probably above 5X, below of which may lead to inadequate heterozygous calls [9]. Such cases are mostly observed in low sequencing coverage panels (below 5X) such as that observed in this study. However, studies have shown that this situation could be overcome by selecting the representative individuals of the putative genetic variation or through genotyping a large number of individuals.
Nevertheless, the low sequencing coverage of GBS is not regarded as a limitation factor in acquisition of accurate genotypic information for meaningful biological assessment even if imputation is not applied [8]. According to the transition to transversion ratio, sensitivity and specificity statistical metrics have shown high accuracy of GBS calls for meaningful biological downstream analyses. However, the low value of ti/tv observed for specific novel sites indicate that these sites maybe associated with false positives calls as a result of sequencing errors reflecting low degree of accuracy in these sites.
Because of high concordance rates observed, our study concludes that GBS generates enough polymorphic information that could be used in cattle genomic studies (for example in population genetics) regardless of its low sequencing read depth (a factor for a reduced overall cost of sequencing per sample) compared to WGS which is very expensive due to its high sequencing reads coverage. This discrepancy however, can be minimized by genotyping a large number of individuals so as to call enough variants especially those located at rare loci [10]. Usually, low sequence coverage datasets contain minimal levels of polymorphic information (Additional file 1: Figure S13a) as compared to when more individuals are included for genotyping (Additional file 1: Figure S17). Not only that, but also, the use of imputation helps improve the quality of genotypes called in low sequence coverage datasets [8,11]. We therefore recommend that GBS can be applied in surrogate to WGS, only when involving a large number of individuals, which could be followed by imputation procedures, provided stringent filtering parameters are adhered prior to conducting subsequent downstream analyses.   Note: *wt~with respect to